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BACKGROUND OF TH E INVENTION 
1. Field of the Invention 

This invention relates to computer architecture. 
In particular, this invention relates to the design of 
an instruction unit in a superscalar processor. 



2. Discussion of the Related Art 

Parallelism is extensively exploited in modern 
computer designs. Among these designs are two distinct 
15 architectures which are known respectively as the very 
long instruction word (VLIW) architecture and the 
superscalar architecture. A superscalar processor is a 
computer which can dispatch one, two or more 
instructions simultaneously. Such a processor 
20 typically includes multiple functional units which can 
independently execute the dispatched instructions. In 
such a processor, a control logic circuit, which has 
come to be known as the "grouping logic" circuit, 
determines the instructions to dispatch (the 
25 "instruction group"), according to certain resource 

allocation and data dependency constraints. The task 
of the computer designer is to provide a grouping logic 
circuit which can dynamically evaluate such constraints 
to dispatch instruction groups which optimally use the 
30 available resources. A resource allocation constraint 
can be, for instance, in a computer with a single 
floating point multiplier unit, the constraint that no 
more than one floating point multiply instruction is to 
be dispatched for any given processor cycle. A 
3 5 processor cycle is the basic timing unit for a 

pipelined unit of the processor, typically the clock 
period of the CPU clock. An example of a data 
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dependency constraint is the avoidance of a "read- 
after-write" hazard. This constraint prevents 
dispatching an instruction which requires an operand 
from a register which is the destination of an write 
instruction dispatched earlier, but yet to be 
unretired . 

A VLIW processor, unlike a superscalar processor, 
does not dynamically allocate system resources at run 
time. Rather, resource allocation and data dependency 
analysis are performed during program compilation. A 
VLIW processor decodes the long instruction word to 
provide the control information for operating the 
various independent functional units. The task of the 
compiler is to optimize performance of a program by 
15 generating a sequence of such instructions which, when 
decoded, efficiently exploit the program's inherent 
parallelism in the computer's parallel hardware. The 
hardware is given little control of instruction 
sequencing and dispatch. 

A VLIW computer, however, has a significant 
drawback in that its programs must be recompiled for 
each machine they run on. Such recompilation is 
required because the control information required by 
each machine is encoded in the instruction words. A 
superscalar computer, by contrast, is often designed to 
be able to run existing executable programs (i.e., 
"binaries") . In a superscalar computer, the 
instructions of an existing executable program are 
dispatched by the computer at run time according to the 
computer -s particular resource availability and data 
integrity requirements. From a computer user's point 
of view, because existing binaries represent 
significant investments, the ability to acquire 
enhanced performance without the expense of purchasing 
35 new copies of binaries is a significant advantage. 

In the prior art, to determine the instructions 
that go into an instruction group of a given processor 
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cycle, a superscalar computer performs the resource 
allocation and data dependency checking tasks in the 
immediately preceding processor cycle. Under this 
scheme, the computer designer must ensure that such 
resource allocation and data dependency checking tasks 
complete within their processor cycle. As the number 
of the functional units that can be independently run 
increases, the time required for performing such 
resource allocation and data dependency checking tasks 
grows more rapidly than linearly. Consequently, xn a 
superscalar computer design, the ability to perform 
resource and data integrity analysis within a sxngle 
processor cycle can become a factor that limits the 
performance gain of additional parallelism. 

gTTMMARY OF ^ INVENTION 

The present invention provides a central 
processing unit which includes a grouping logic circuit 
for determining simultaneously dispatchable 
instructions in an processor cycle. The central 
processing unit of the present invention includes such 
a grouping logic circuit and a number of functional 
units, each adapted to execute one or more specxfxed 
instructions dispatched by the grouping logic cxrcuxt. 
The grouping logic circuit includes a number of 
pipeline stages, such that resource allocation and data 
dependency checks can be performed over a number of 
processor cycles. The present invention therefore 
allows dispatching a large number of instruction 
simultaneously, while avoiding the complexity of the 
grouping logic circuit from becoming limiting the 
duration of the central processing unit's processor 
cycle . 

in one embodiment, the grouping logic circuxt 
checks intra-group data dependency immediately upon 
receiving the instruction group. In that embodiment, 
all instruction in a group of instructions received xn 
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a first processor cycle are dispatched prior to 
dispatching any instruction of a second group of 
instructions received at an processor cycle subsequent 
to said first processor cycle. 
5 The present invention is better understood upon 

consideration of the detailed description below in 
conjunction with the accompanying drawings. 

Tap tff nF..qCRIPTTr>N OF THE DRAWINGS 
L0 Figure 1 is a block diagram of a CPU 100, in an 

exemplary 4-way superscalar processor of the present 
invention . 

Figure 2 shows schematically a 4 -stage pipelined 
grouping logic circuit 109 in the 4-way superscalar 
15 processor of Figure 1 . 

rP ^TT,n nF.gCRTPTION n* TWE PREFERRED EMBODIMENTS 

An embodiment of the present invention is 
illustrated by the block diagram of Figure 1, which 
shows alentral processing unit (CPU) 100 in an 
exemplary Xway superscalar processor of the present 
invention. A\4-way superscalar processor fetches, 
dispatches, exeWes and retires up to four 
instructions perVocessor cycle. As shown in Figure 
25 1, central processing unit 100 includes two arithmetic 
logic units 101 and 1*2. a load/store unit 103, whxch 
includes a 9-deep loadWfer 104 and an 8-deep store 
buffer 105, a floating p&Uit adder 106, a floating 
point multiplier 107, and \floa ting point divider 108. 
30 in this embodiment, a group ifc<j logic circuit 109 

dispatches up to four instructions per processor cycle. 
Completion unit 110 retires instructions upon 
completion. A register file (not Vn), including 
numerous integer and float point registers, is provided 
35 with sufficient number of ports to prevW contention 

among functional units for access to this\^egister file 
during operand fetch or result write -back. Vn this 
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embodiment also, loads are non-blocking, i.e., CPU 100 
CV Intinues to execute even though one or more dispatched 

loSt^nstructions have not complete. When the data of 
the lo^istructions are returned from the main 
5 memory, thes>^ta can be placed in a pipeline for 

storage in a seco^vel cache. In this embodiment, 
floating point adder lO^ndfloating point multiplier 
107 each have a 4 -stage pipe^hae^- Similarly , 
load/store unit 103 has a 2-stage >p^line - Floating 
10 point divider 108, which is not pipelin^^requires 
more than one processor cycle per instructions. 

To simplify the discussion below, the state of CPU 
100 relevant to grouping logic 109 is summarized by a 
state variable S(t), which is defined below. Of 

0 is course, the state of CPU 100 includes also other 

« variables, such as those conventionally included in the 

1 processor status word. Those skilled in the art would 
W appreciate the use and implementation of processor 

S states. Thus, the state S(t) at time t of CPU 100 can 

be represented by: 

SU) = iALU ± ( t\ALU 2 ( t) , LS(t), LB(t) , SB(t), 
FA { t) , mit) , FSD(t) } 

where ALU x <t) andALU 2 (t) are the states, at time t, 

of arithmetic logic units 101 and 102 
respectively; LS(t) and LB (t) are the 
states, at time t, of store buffer 105 and 
load buffer 104 respectively; FA ( t ) , FM(t), 
and FDS(t) are the states, at time t, of 
floating point adder 106, floating point 
multiplier 107 and floating point divider 108 
30 respectively. 

At any given time, the state of each functional 
unit can be represented by the source and destination 
registers specified in the instructions dispatched to 
the functional unit but not yet retired. Thus, 
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ALU, = iALU r .rsl(t), ALU X .ZS2U). ALU^rdU)} 



where 



rsl(t), rs2(t> and rd(t) are respectively the 
first and second source registers, and the 
destination of registers of the instruction 
executing at time t in arithmetic logic unit 

Similarly, the state of arithmetic logic unit 102 
can be defined as: 

ALU, - lAL^.rsUO. ALU,.rs2(t). ALU 2 .rd(t)} 



l>*f> sor pipelined functional units, such as floating 
°%int a^ 106. the state is relatively more complex, 
consisting^- source and destination registers of. 
the instruction^ their respectively pipeline. Thus, 
for the pipelined ^ i.e., load/store ^ 103 ' 
load buffer 104, store l^fer 105, floating point adder 
106. and floating point muiHpOier 107. their 
respective states, at time t. £^ . LB (t) SB(t), 
PA(t) and FM(t) can be represented 

L S- US. LS.rs2.it). LS.rd.U)) for i-U. 2) 



15 



LB = 



{LB. rsl.it), LB.rs2.it), LB.rd.it)} 
for i={l,2, ... ,9} 



20 



SB = 



iSB.rsl.it), SB. rs2 . i t) , SB. rd. i t) } 
for i= {l , 2 , . . • » 8} 



M = lB.»l i ltl, FA.rs2.it). FA.ra.it)) for i-U 4) 



i>-^> Finally - floa \ 9 point divider 108 ' s state FSD(t> 



L:\DMS\8242\M-3876_U\0171325.01 

FM - iFM.rsl.it), FM.rs2.it). FM.rd.it)} fori-U 4} 

can be represented by: 

FDS= {FDS.rsl.it), FDS.rs2.it), FDS.rd.it)} 

State variable S(t) can be represented by a memory 
element, such as a register or a content addressable 
5 memory unit, at either a centralized location or in a 
distributed fashion. For example, in the attributed 
approach, the portion of state S(t) associated with a 
given functional unit can be implemented with the 
control logic of the functional unit. 
10 in the prior art, a grouping logic circuxt would 

determine from the current state, S(t) at time t, the 
next state S(t + 1>, which includes information necessary 
to dispatch the instructions of the next processor 
cycle at time t + l. For example, to avoid a read-after- 
15 write hazard, such a grouping circuit would exclude 
from the next state S(t + 1) an instruction having an 
operand to be fetched from a register designated for 
storing a result of a yet incomplete instruction. As 
another example, such a grouping circuit would include 
20 in state S(t + 1) no more than one floating point "add 
instruction in each processor cycle, since only one 

• M <=» floating point adder 106) is 

floating point adder (i.e. rioatiiiy y 

available. As discussed above, as complexity 

increases, the time required for propagating through 

25 the grouping logic circuit can become a critical path 

for the processor cycle. Thus, in accordance with the 

present invention, grouping logic circuit 109 is 

pipelined to derive, over r processor cycles, a future 

state S(t + r) based on the present state S(t) . The 

30 future state S(t + r) determines the instruction group to 

dispatch at time t + r. Pipelining grouping logic 109 is 

possible because, as demonstrated below, (i) the values 

of most state variables in the state S(t + r) can be 
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estimated fro, corresponding values of state S(t) with 
sufficient accuracy, and (ii> for those state varxable 
for which values can not be accurately predicted, it is 
relatively straightforward to provide for all possible 
outcomes of state S(t + r) , or to use a conservative 
approach (i.e. not dispatching an instruction when such 
an instruction could have been dispatched) with a 
slight penalty on performance. 

The process for predicting state S(t + r) is 
explained next. The following discussion will first 
show that most components of next state S(t + l) can be 
precisely determined from present state S(t), and the 
remaining components of state S(t) can be reasonably 
determined, provided that certain non-deterministic 
conditions are appropriately handled. By induction, it 
can therefore be shown that future state S(t + r), where 
r is greater than 1, can likewise be determined from 

state S(t) • _ 

Since an instruction in floating point adder 106 
or floating point multiplier 107 completes after four 
processor cycles and an instruction in load/store unit 
103 completes after two processor cycles, the states 
FA FM and LS at time t + l can be derived from the 
corresponding state S(t) at time t, the immediately 
preceding processor cycle. In particular, the 
relationship governing the source and destination 
registers of each instruction executing in floating 
point adder 106, floating point multiplier 107 and 
load/store unit 103 between time t + l and time t are: 
rsl, (t+l> = rsl^it), for i<Uk 



30 



zs2 i (t+l) =rs2 i . 1 (t), £orl<i*k 
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rdjCt+l) =rd i _ 1 (t), for Ki*k 




where k is the depth of the respective pipeline. 

.^fee state FSD(t+l) of floating point divider 108, 
in which^the^fcAjne required to execute an instruction 
can exceed an pr^§s©i^cycle, is determined from state 
FSD ( t ) by : — 

FSD{t+l) = FSD ( t) { if last stageT^&isenull 
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Whether or not floating point divider 108 is in its 
last stage can be determined from, for example, a 
hardware counter or a state register, which keep tracks 
of the number of processor cycles elapsed since the 
instruction in floating point divider 108 began 
execution . 

load buffer 104 and store buffer 105, since the 
pending \ead or write operation at the head of each 
queue need\not complete within one processor cycle, the 
state LB(t+lkat time t+1 cannot be determined from the 
immediately previous state LB (t ) at time t with 
certainty. However, since state LB (t+1) can only 
either remain the d*me, or reflect the movement of the 
pipeline by one stage\ two possible approaches to 
determine state LB (t+1 T\can be used. First, a 
conservative approach wouXLd predict LB (t+1) to be the 
same as LB ( t ) . Under this Napproach, when load buffer 
104 is full, an instruction \s not dispatched until the 
pipeline in load buffer 106 advances. An incorrect 
prediction, i.e. a load instruction completes during 
the processor cycle of time t, thik conservative 
approach leads to a penalty of one processor cycle, 
since a load instruction could have bedo dispatched at 
time t+1. Alternatively, a more aggressWi approach 
provides for both outcomes, i.e. load buffeV 104 
advances one stage, and load buffer 104 remains the 
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s^ame. Under this aggressive approach, grouping logic 
109 is ready to dispatch a load instruction, such 
dispatch to be enabled by a control signal which 
indicates, at time t+1, whether a load instruction has 
in fa\t completed. This aggressive approach requires 
more aXcomplex logic circuit than the conservative 
approacl 

Thus, the skilled person would appreciate that 
state S(t+1) of CPU 100 can be predicted from state 
S(t). Consequently, both the number of instructions 
and the types of instructions that can be dispatched at 
time t+1 (i.e. the instruction group at time t+1) based 
on predicted state S(t+1) can be derived, at time t, 
from state S(t), subject to additional handling based 
on the actual state S A (t+l) at time t+1. 

The above analysis can be can be extended to allow 
state S(t+r) at time t+r to be derived from state S(t) 
at time t. The instruction group at time t+T can be 
derived from time t, provided that, for each 
instruction group between time t and t+r, all 
instruction from that instruction group must be 
dispatched before any instruction from a subsequent 
instruction group is allowed to be dispatched (i.e. no 
instruction group merging) . 

Since instructions from different instruction 
groups are not merged, intra-group dependencies and 
inter-group dependencies can be checked in parallel . 
The instructions are either fetched from an instruction 
cache or an instruction buffer. An instruction buffer 
is preferable in a system in which not all accesses 
(e.g. branch instructions) to the instruction cache are 
aligned, and multiple entry points in the basic blocks 
of a program are allowed. 

Once four candidate instructions for an 
instruction group\re identified, intra-group data 
dependency checkingV:an begin. Because of the 
constraint against instruction group merging described 




above, i.e., all instructions in an instruction group 
must be dispatched before an instruction from a 
subsequent instruction group can be dispatched, intra- 
group dependency checking can be accomplished in a 
5 pipelined fashion. That is, intra -group dependency 

checking can span more than one processor cycle and all 
inter-group dependency checking can occur independently 
of inter-group dependency checking. For the purpose of 
intra -group dependency check, each instruction group 
10 can be represented by: 

IntraS(C) = {rsl^t) , xs2 i (t) , zd ± ( t) , res^t)} 

for 0*i<W-l 



15 



20 



where W is the width of the machine, and reSi 
represents the resource utilization of instruction I. 
An example of a four-stage pipeline 200 is shown in 
Figure 2. In Figure 2, at first stage 2 01, as soon as 
the instruction group is constituted, intra-group 
dependency checking is performed immediately. 
Thereafter, at stage 202, resource allocation within 
the instruction group can be determined. At stage 203, 
intergroup decisions, e.g. resource allocation 
decisions taking into consideration resource allocation 
in previous instruction groups, are merged with the 
decisions at stages 201 and 202. For example, if the 
present instruction group includes an instruction 
25 designated for floating point divider 108, stage 203 
would have determined at by this time if a previous 
instruction using floating point divider 108 would have 
completed by the time the present instruction group is 
due to be dispatched. Finally, at stage 204, non- 
deterministic conditions, e.g. the condition at store 
buffer 105, is considered. Dispatchable instructions 
are issued into CPU 100 at the end of stage 204. 

The above detailed description is provided to 
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illustrate the specific embodiments of the present 
invention and is not intended to be limiting. Numerous 
variations and modifications within the scope of the 
present invention are possible. The present invention 
5 is defined by the following claims. 
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